Zoltán Konyha, VRVis, konyha@vrvis.at [PRIMARY
contact]
Andreas Ammer, VRVis, ammer@vrvis.at
Krešimir Matković, VRVis, matkovic@vrvis.at
Çağatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no
Denis Gračanin, Virginia Tech, gracanin@vt.edu
We have used our interactive, multiple linked views visualization
application ComVis in our analysis. ComVis can visualize scalar, categorical
and time series data in several different views. Each view is interactive and
brushable. Brushes defined in the same view or in different views can be combined
using boolean operators. The visual analysis context is captured in session
files. Exchanging session files facilitates better collaboration among members of
our team distributed in several cities.
We created a Python script to automate filtering and aggregation tasks.
We have chosen Python because it allows rapid prototyping. Actually, the script
evolved as part of our analysis and served as a powerful semi-automatic feature
extraction tool. By editing the Python code, we could flexibly compute any
aggregate data we required for the visual analysis in only a few minutes. This
became especially obvious when we required separate symptom statistics before
and after day 7 of the time period in the data set to estimate the number of
people infected. Overall, this approach was more efficient than trying to
implement similar custom aggregation methods in the C++ code of ComVis or
implementing a generic aggregation framework.
We used Microsoft ExcelTM to create the bar chart in Figure
1.1 and to compute the numbers of people infected in different cities.
Video:
ANSWERS:
MC2.1:
Analyze the records you have been given to characterize the spread of the
disease. You should take into
consideration symptoms of the disease, mortality rates, temporal patterns of
the onset, peak and recovery of the disease.
Health officials hope that whatever tools are developed to analyze this
data might be available for the next epidemic outbreak. They are looking for visualization tools that
will save them analysis time so they can react quickly.
Mortality
rate
We first computed and examined various aggregates.
The overall mortality rate in each location was computed by dividing the number
of death records by the number of hospitalization records. We identified four
clusters in Figure 1.1:
Figure 1.1: Mortality rates in each
location. The average excludes "non-virus" locations.
Temporal
patterns in the virus outbreak
We computed two time series for each location:
Figure 1.2: Bottom left: the number of
deaths on each day. Right: the number of hospitalization on each day (top) and
its 7 day moving average (bottom). Dates from 4-16-2009 (day 0) to 6-30-2009
(day 75) are displayed on the horizontal axes. Each curve in these diagrams
represents one location. An additional curve displays the global sum. The curve
specific to a location can be highlighted by brushing the location in the top
left histogram. One can brush "_GLOBAL_" to highlight the curves that
display global data. (Click to enlarge.)
We focus here on global patterns. Individual
locations are discussed in MC2.2. The number of deaths starts to increase rapidly
near day 15 (May 1). The peak of the epidemic (in terms of deaths) was on day 38
(May 24). The recovery phase lasted until approximately day 70 (June 25),
although this transition is not as clearly pronounced as the outbreak.
The time series of hospitalizations oscillates rapidly.
We applied a seven day moving average to smooth the curves. The number of
hospitalizations reached its maximum near day 31 (May 17). There are at least
two more local maxima, near day 51 (June 6) and 72 (June 27). This likely
indicates the second and third waves of the epidemic. The epidemic cycle is ~21
days between those phases.
Number
of days in hospital
We merged the hospitalization and death records for
each location using the patient IDs as primary keys (using the Python script).
For patients who died in hospital, the number of days between hospitalization
and death was computed, too. The histograms of this data are shown in Figure 1.3.
The histograms of
It would be interesting to learn how long it took
for surviving patients to recover and leave hospital, but that information is
not contained in the data set.
There is no significant correlation between
mortality and age or gender.
Figure 1.3: Histograms of the number
of days between hospitalization and death in
Symptoms
We edited the script to extract the most common
words found in the syndromes. For each word, the script computes the percentage
of dead and surviving patients whose records include the given symptom. This approach
had several shortcomings. That became obvious when we attempted to find
patterns in the word frequencies. The word "and" was included. Some
words and their abbreviations appeared as separate entries. We filtered out
"and" and replaced some common abbreviations, including "l"
and "lt"->"left", "r" and
"rt"->"right",
"abd"->"abdominal",
"inj"->"injury". The word "pain" is a very
generic symptom that often appears in expressions such as "back pain"
or "abdominal pain". The data extraction preserves those
combinations. We did not address several other issues, like repeated words
("abd abd pain"), missing spaces ("vomitingdiarrhea"), etc.
Figures 1.4 and 1.5 capture the process of
identifying the symptoms that appear most often in dead patients in
"virus" locations. Since over 94% of the dead patients are victims of
the virus, we assume those are symptoms of the infection.
Figure 1.4: Top left: symptoms are displayed
on the horizontal axis, the percentage of dead patients on the vertical axis. The
brush selects points that represent symptoms found in many of the dead
patients. Top right: each point represents one symptom. The X and Y coordinates
indicate the percentages of surviving and dead patients with this symptom.
Points above the diagonal represent symptoms that are likely to cause death.
Bottom left: each bar represents a location. Bottom right: each bar represents
a symptom.
Figure 1.5: We are not interested in
symptoms of the
The most characteristic virus symptoms and the
percentage of victims hospitalized with those symptoms can be inferred from the
highlighted items in Figure 1.5, bottom right and top left:
Also worth mentioning:
MC2.2:
Compare the outbreak across cities. Factors
to consider include timing of outbreaks, numbers of people infected and
recovery ability of the individual cities.
Identify any anomalies you found.
The number of hospitalizations and deaths vary across
locations, largely because of differences in population. The time series computed
for MC2.1 were normalized to a common scale to compare curves. Normalized
versions were computed by dividing each series with its maximum.
The number of deaths were normalized by the total
number of hospitalizations. These time series contain a lot of information
about the evolution of the epidemic because it preserves the differences in
mortality rates across locations:
We computed the first derivative of D_maxDloc(d), too. Figure 2.3
top right.
Figure 2.1: Top left: Each bar
represents a location. "Virus" locations (red) and the global sum
(blue) are brushed. "Non-virus" locations are grey. Top right: number
of deaths per day. Bottom left: normalization brings the curves to the same scale.
The grey curves that belong to "non-virus" locations oscillate so
wildly because they are heavily magnified by the normalization. Bottom right:
number of deaths per day divided by the total number of hospitalizations. The
grey curves stay close to zero.
Outbreak
We first tried to find the dates of virus outbreak
in each location. We need to formulate what defines the virus outbreak. Our
best bet is to examine the daily mortality rates in "virus" locations
and find when they exceed the maximum in "non-virus" locations
(Figure 2.2 with
The vast majority of victims die after eight days in
hospital. We believe the first infected people were hospitalized eight days
before mortality rate started to increase. These are the dates of the outbreak.
Figure 2.2: Top left: "non-virus"
locations are brushed in red.
Other criterion (such as the mean daily mortality
rate in "non-virus" locations) could also define the start of the
epidemic. That would shift the outbreak dates (Table 2.1) two or three days earlier.
Peaks
and outliers
Using a procedure very similar to the one in the
previous paragraph, we selected individual locations and looked for the maxima
in the linked daily mortality rate curves. They indicate the peaks of the
epidemic, in terms of number of victims. Some curves have several pronounced
local maxima so we examined the first derivates. The snapshot of this process (Figure
2.3) captures some interesting outliers.
Figures 2.4 and 2.5 show two clusters in the shapes
of the daily mortality rates.
Figure 2.3: Top right: first derivative
of the mortality rate with outliers brushed. Bottom left: normalized number of
hospitalizations. Bottom right: number of deaths per day divided by the total
number of hospitalizations.
Figure 2.4: The meaning of the views
is the same as in Figure 2.3. The selection in the top right is different. Curves
that cross the line brush (thick black line) are selected. The shapes of the
mortality rate curves (bottom right) in
Figure 2.5: Similar to Figure 2.4, but
curves that do not cross the same brush are selected.
Number
of people infected
We can only infer the number of people hospitalized
because of the virus infection. There is no information about the infected
population.
Our calculations are based on the numbers of
patients with the most characteristic symptoms. We found approximately the same
frequency of symptoms in all locations (virus and non-virus) during the first seven days:
We used those numbers as reference values before the
epidemic. The number of patients hospitalized with vomiting during the first
seven days can be extrapolated to estimate how many would have been
hospitalized in the next 69 days, had there been no epidemic. The difference to
the actual numbers after day seven equals the number of patients hospitalized with
vomiting because of the epidemic. Let n
denote that number. We know p, the
percentage of infected patients with the symptom “vomiting” after
the virus outbreak. The number of people hospitalized with virus infection is n/p. We get similar results with the
other two symptoms.
We similarly estimated the increase in number of
deaths because of the epidemic and computed the mortality rate of the epidemic
from the number of deaths related to the epidemic and the number of people
hospitalized because of the epidemic (Table 2.1).
|
Increase
in |
|
|
Number
of |
Mortality
rate |
|
May 3 |
Apr 25 |
May 23 |
556586 |
14.0% |
|
May 6 |
Apr 28 |
May 28 |
137727 |
11.7% |
|
May 5 |
Apr 27 |
May 27 |
80340 |
14.7% |
|
May 4 |
Apr 26 |
May 25 |
1211790 |
13.6% |
|
May 3 |
Apr 25 |
May 25 |
37920 |
19.8% |
|
May 1 |
Apr 23 |
May 22 - May 23 |
328862 |
13.2% |
|
May 4 |
Apr 26 |
May 25 - May 26 |
152004 |
14.0% |
|
May 4 |
Apr 26 |
May 27 |
33286 |
11.1% |
|
May 3 |
Apr 25 |
May 24 - May 25 |
63451 |
12.0% |
Table 2.1: Summary of the virus outbreak across cities.